Exploratory Data Analysis: Wisconsin Diagnostic Breast Cancer (WDBC)¶
1.1 Introduction¶
This report analyzes the Wisconsin Diagnostic Breast Cancer (WDBC) dataset to identify key features distinguishing malignant from benign tumors. The data features were computed from digitized images of fine needle aspirates (FNA) of breast masses, describing the characteristics of the cell nuclei present in the image (Wolberg et al., 1995).
1.2 Data Acquisition¶
The raw data was retrieved directly from the UCI Machine Learning Repository to ensure reproducibility. The dataset consists of 569 instances with 30 real-valued input features and one binary target variable (Diagnosis).
# Imports (only nesseccary for EDA)
import pandas as pd
import numpy as np
import altair_ally as aly
import altair as alt
alt.data_transformers.enable('vegafusion')
from ucimlrepo import fetch_ucirepo
# import the data
# Code from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# Need ucimlrepo package to load the data
raw_data = fetch_ucirepo(id=17)
raw_X = raw_data.data.features
raw_y = raw_data.data.targets
raw_df = pd.concat([raw_X, raw_y], axis=1)
raw_df.to_csv("../data/raw/breast_cancer_raw.csv", index=False)
2. Data Cleaning and Schema Mapping¶
The raw dataset lacks semantic column headers. To facilitate analysis, we implemented a schema mapping strategy based on the wdbc.names metadata. The 30 features represent ten distinct cell nucleus characteristics (e.g., Radius, Texture) computed in three statistical forms.
We applied the following suffix mapping transformation:
- Mean Value: Suffix
1->_mean - Standard Error: Suffix
2->_se - Worst (Max) Value: Suffix
3->_max
This step ensures all features are semantically interpretable for the subsequent EDA.
# Data Cleaning
# There is no NA in the dataset
# Clean the column names based on description
clean_columns = []
for col in raw_X.columns:
if col.endswith('1'):
clean_name = col[:-1] + '_mean'
elif col.endswith('2'):
clean_name = col[:-1] + '_se'
elif col.endswith('3'):
clean_name = col[:-1] + '_max'
else:
clean_name = col
clean_columns.append(clean_name)
raw_X.columns = clean_columns
X = raw_X.copy()
# Clean the target column
y = raw_y.copy()
y['Diagnosis'] = y['Diagnosis'].map({'M': 'Malignant', 'B': 'Benign'})
clean_df = pd.concat([X, y], axis=1)
# Export the cleaned data
clean_df.to_csv('../data/processed/breast_cancer_cleaned.csv', index=False)
clean_df
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave_points_mean | symmetry_mean | fractal_dimension_mean | ... | texture_max | perimeter_max | area_max | smoothness_max | compactness_max | concavity_max | concave_points_max | symmetry_max | fractal_dimension_max | Diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.16220 | 0.66560 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | Malignant |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.12380 | 0.18660 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | Malignant |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.14440 | 0.42450 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | Malignant |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.20980 | 0.86630 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | Malignant |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.13740 | 0.20500 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | Malignant |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 564 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 | Malignant |
| 565 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 | Malignant |
| 566 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 | Malignant |
| 567 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 | Malignant |
| 568 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 | Benign |
569 rows × 31 columns
3. Data Profiling: Structure and Statistics¶
Purpose:
df.info(): Used to verify data integrity by checking for null values and ensuring all feature columns are offloat64type.df.describe(): Used to examine the central tendency and spread of numeric features. This highlights differences in magnitude (scales) across variables.
Observation:
The dataset is complete (no missing values). However, describe() reveals massive scale disparities (e.g., area_mean ranges up to 2500, while smoothness_mean is < 0.1), confirming the necessity for Feature Scaling (Standardization) before modeling.
clean_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 radius_mean 569 non-null float64 1 texture_mean 569 non-null float64 2 perimeter_mean 569 non-null float64 3 area_mean 569 non-null float64 4 smoothness_mean 569 non-null float64 5 compactness_mean 569 non-null float64 6 concavity_mean 569 non-null float64 7 concave_points_mean 569 non-null float64 8 symmetry_mean 569 non-null float64 9 fractal_dimension_mean 569 non-null float64 10 radius_se 569 non-null float64 11 texture_se 569 non-null float64 12 perimeter_se 569 non-null float64 13 area_se 569 non-null float64 14 smoothness_se 569 non-null float64 15 compactness_se 569 non-null float64 16 concavity_se 569 non-null float64 17 concave_points_se 569 non-null float64 18 symmetry_se 569 non-null float64 19 fractal_dimension_se 569 non-null float64 20 radius_max 569 non-null float64 21 texture_max 569 non-null float64 22 perimeter_max 569 non-null float64 23 area_max 569 non-null float64 24 smoothness_max 569 non-null float64 25 compactness_max 569 non-null float64 26 concavity_max 569 non-null float64 27 concave_points_max 569 non-null float64 28 symmetry_max 569 non-null float64 29 fractal_dimension_max 569 non-null float64 30 Diagnosis 569 non-null object dtypes: float64(30), object(1) memory usage: 137.9+ KB
clean_df.describe()
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave_points_mean | symmetry_mean | fractal_dimension_mean | ... | radius_max | texture_max | perimeter_max | area_max | smoothness_max | compactness_max | concavity_max | concave_points_max | symmetry_max | fractal_dimension_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
| std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
| min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
| 50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
| 75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
| max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 30 columns
4. Correlation Analysis: Pearson vs. Spearman¶
Method:
- Pearson Correlation: Measures linear relationships.
- Spearman Correlation: Measures monotonic rank relationships (non-linear). Comparing both helps identify if relationships are strictly linear or just trending in the same direction.
Purpose: To detect Multicollinearity—redundant features that increase model complexity without adding information.
Results:
Both metrics show near-perfect correlation ($>0.95$) between Radius, Perimeter, and Area. This confirms these features are geometrically redundant. We should retain only one (e.g., Radius) and drop the others to improve model stability.
# Multicollinearity
corr_chart = aly.corr(clean_df)
corr_chart.save('../results/images/corr_chart.png')
corr_chart.save('../results/images/corr_chart.svg')
corr_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated. Use 'selection_point' warnings.warn(message, AltairDeprecationWarning, stacklevel=1) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/vegalite/v5/api.py:394: AltairDeprecationWarning: The value of 'empty' should be True or False. warnings.warn(
5. Pairwise Separability Analysis¶
Purpose: To visualize 2D decision boundaries. We look for feature combinations where the Benign (Blue) and Malignant (Orange) clusters are clearly distinct with minimal overlap.
Results:
- High Separability: Features related to size (
radius_mean) and shape complexity (concavity_mean) separate the classes well. - Non-linear patterns: The curved relationship between
areaandradiusis clearly visible, reinforcing the geometric redundancy found in the correlation analysis.
# Only include mean as it provide a lot of info
cols_mean = [c for c in clean_df.columns if '_mean' in c] + ['Diagnosis']
pair_chart = aly.pair(clean_df[cols_mean], color='Diagnosis:N')
pair_chart.save('../results/images/pair_chart.png')
pair_chart.save('../results/images/pair_chart.svg')
pair_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated. Use 'selection_point' warnings.warn(message, AltairDeprecationWarning, stacklevel=1) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead. warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
6. Distribution Analysis¶
Purpose: To inspect the univariate "shape" of the data. We look for Skewness (asymmetry) and Outliers that could bias linear models.
Results:
- Skewness: Features like
area_seandconcavity_meanare heavily right-skewed (long tail to the right). This indicates that Log Transformation is required to normalize these distributions. - Overlap: "Texture" and "Smoothness" show high overlap between classes, suggesting they are less informative on their own compared to "Size" features.
dist_chart =aly.dist(clean_df, color='Diagnosis')
dist_chart.save('../results/images/dist_chart.png')
dist_chart.save('../results/images/dist_chart.svg')
dist_chart
EDA Findings¶
- Class Separation:
- High Separability: Features related to size (
radius,perimeter,area) and concavity (concave_points,concavity) show clear distinction between Benign and Malignant classes (Malignant samples generally have higher values). - Low Separability: Texture, Smoothness, and Fractal Dimension show significant overlap, indicating they are weaker individual predictors.
- High Separability: Features related to size (
- Distributions:
- Skewness: "Area" and "Concavity" features (both
_meanand_se) are heavily right-skewed. - Outliers: Visible in the upper tails of
area_maxandperimeter_se.
- Skewness: "Area" and "Concavity" features (both
- Correlations (Multicollinearity):
- Severe Multicollinearity:
radius,perimeter, andareaare perfectly correlated ($R \approx 1$). This is expected geometrically but redundant for models. concavity,concave_points, andcompactnessalso exhibit very high positive correlation.
- Severe Multicollinearity:
Preprocessing Recommendations¶
Based on the above, the following pipeline is suggesued:
- Feature Selection / Drop:
- Remove redundant features to reduce multicollinearity. Keep
radius(orperimeter), but dropareaandperimeteras they duplicate information.
- Remove redundant features to reduce multicollinearity. Keep
- Transformation:
- Apply Log Transformation to skewed features (e.g.,
area,concavity) to normalize distributions.
- Apply Log Transformation to skewed features (e.g.,
- Scaling:
- Features vary vastly in scale (e.g.,
area> 1000 vs.smoothness< 0.2). UseStandardScalerto standardize all features to unit variance.
- Features vary vastly in scale (e.g.,
- Imputation:
- None needed (Data is clean).
Onto Creating a Classification Model¶
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
X = clean_df.drop('Diagnosis', axis=1)
y = clean_df['Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train.columns
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_max', 'texture_max', 'perimeter_max',
'area_max', 'smoothness_max', 'compactness_max', 'concavity_max',
'concave_points_max', 'symmetry_max', 'fractal_dimension_max'],
dtype='object')
numeric_feats = ['radius_mean', 'texture_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_max', 'texture_max',
'smoothness_max', 'compactness_max', 'concavity_max',
'concave_points_max', 'symmetry_max', 'fractal_dimension_max']
drop_feats = [
'perimeter_mean',
'area_mean',
'perimeter_se',
'area_se',
'texture_se',
'smoothness_se',
'symmetry_se',
'perimeter_max',
'area_max'
]
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
ct = make_column_transformer(
(StandardScaler(), numeric_feats),
("drop", drop_feats)
)
pipe = Pipeline([
("preprocess", ct),
("svc", SVC())
])
param_grid = {
"svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
"svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100]
}
gs = GridSearchCV(
estimator = pipe,
param_grid = param_grid,
cv = 15,
n_jobs = -1,
return_train_score = True
)
gs.fit(X_train, y_train)
GridSearchCV(cv=15,
estimator=Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['radius_mean',
'texture_mean',
'smoothness_mean',
'compactness_mean',
'concavity_mean',
'concave_points_mean',
'symmetry_mean',
'fractal_dimension_mean',
'radius_se',
'texture_se',
'smoothness_se',
'compactness_se',
'concavity_se',
'con...
'concavity_max',
'concave_points_max',
'symmetry_max',
'fractal_dimension_max']),
('drop',
'drop',
['perimeter_mean',
'area_mean',
'perimeter_se',
'area_se',
'texture_se',
'smoothness_se',
'symmetry_se',
'perimeter_max',
'area_max'])])),
('svc', SVC())]),
n_jobs=-1,
param_grid={'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
'svc__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]},
return_train_score=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=15,
estimator=Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['radius_mean',
'texture_mean',
'smoothness_mean',
'compactness_mean',
'concavity_mean',
'concave_points_mean',
'symmetry_mean',
'fractal_dimension_mean',
'radius_se',
'texture_se',
'smoothness_se',
'compactness_se',
'concavity_se',
'con...
'concavity_max',
'concave_points_max',
'symmetry_max',
'fractal_dimension_max']),
('drop',
'drop',
['perimeter_mean',
'area_mean',
'perimeter_se',
'area_se',
'texture_se',
'smoothness_se',
'symmetry_se',
'perimeter_max',
'area_max'])])),
('svc', SVC())]),
n_jobs=-1,
param_grid={'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
'svc__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]},
return_train_score=True)Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['radius_mean',
'texture_mean',
'smoothness_mean',
'compactness_mean',
'concavity_mean',
'concave_points_mean',
'symmetry_mean',
'fractal_dimension_mean',
'radius_se', 'texture_se',
'smoothness_se',
'compactness_se',
'concavity_se',
'concave_points_se',
'symmetry_se',
'fractal_dimension_se',
'radius_max', 'texture_max',
'smoothness_max',
'compactness_max',
'concavity_max',
'concave_points_max',
'symmetry_max',
'fractal_dimension_max']),
('drop', 'drop',
['perimeter_mean',
'area_mean', 'perimeter_se',
'area_se', 'texture_se',
'smoothness_se',
'symmetry_se',
'perimeter_max',
'area_max'])])),
('svc', SVC())])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['radius_mean', 'texture_mean',
'smoothness_mean', 'compactness_mean',
'concavity_mean', 'concave_points_mean',
'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'smoothness_se',
'compactness_se', 'concavity_se',
'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_max',
'texture_max', 'smoothness_max',
'compactness_max', 'concavity_max',
'concave_points_max', 'symmetry_max',
'fractal_dimension_max']),
('drop', 'drop',
['perimeter_mean', 'area_mean', 'perimeter_se',
'area_se', 'texture_se', 'smoothness_se',
'symmetry_se', 'perimeter_max',
'area_max'])])['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_max', 'texture_max', 'smoothness_max', 'compactness_max', 'concavity_max', 'concave_points_max', 'symmetry_max', 'fractal_dimension_max']
StandardScaler()
['perimeter_mean', 'area_mean', 'perimeter_se', 'area_se', 'texture_se', 'smoothness_se', 'symmetry_se', 'perimeter_max', 'area_max']
drop
SVC()
results = pd.DataFrame(gs.cv_results_)
best_performing = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].sort_values(
by='mean_test_score', ascending=False
).head(10)
heatmap_data = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].copy()
heatmap_data['C'] = heatmap_data['param_svc__C'].astype(str)
heatmap_data['gamma'] = heatmap_data['param_svc__gamma'].astype(str)
heatmap = alt.Chart(heatmap_data).mark_rect().encode(
x = alt.X('gamma:N', title='gamma'),
y = alt.Y('C:N', title='C'),
color = alt.Color('mean_test_score:Q', scale=alt.Scale(scheme='viridis')),
tooltip = ['C', 'gamma', 'mean_test_score']
).properties(
width = 400,
height = 400,
title = 'SVM GridSearchCV Mean Test Scores'
)
best_performing
| param_svc__C | param_svc__gamma | mean_test_score | |
|---|---|---|---|
| 25 | 10 | 0.01 | 0.969176 |
| 31 | 100 | 0.01 | 0.966667 |
| 30 | 100 | 0.001 | 0.960287 |
| 19 | 1.0 | 0.01 | 0.955986 |
| 24 | 10 | 0.001 | 0.955914 |
| 20 | 1.0 | 0.1 | 0.955914 |
| 26 | 10 | 0.1 | 0.953620 |
| 32 | 100 | 0.1 | 0.951470 |
| 18 | 1.0 | 0.001 | 0.931613 |
| 14 | 0.1 | 0.1 | 0.927455 |
heatmap.display()
from sklearn.metrics import classification_report, confusion_matrix
y_pred = gs.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose().drop('support', axis = 1).drop(['macro avg', 'weighted avg'])
report_df
| precision | recall | f1-score | |
|---|---|---|---|
| Benign | 0.986486 | 1.000000 | 0.993197 |
| Malignant | 1.000000 | 0.975610 | 0.987654 |
| accuracy | 0.991228 | 0.991228 | 0.991228 |
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index = gs.classes_, columns = gs.classes_)
cm_melted = cm_df.reset_index().melt(id_vars='index')
cm_melted.columns = ['Actual', 'Predicted', 'Count']
heatmap = alt.Chart(cm_melted).mark_rect().encode(
x = alt.X('Predicted:N', title = 'Predicted'),
y = alt.Y('Actual:N', title = 'Actual'),
color = alt.Color('Count:Q', scale = alt.Scale(scheme ='viridis'))
).properties(
width = 400,
height = 400,
title = 'Confusion Matrix Heatmap'
)
text = alt.Chart(cm_melted).mark_text(color = 'white').encode(
x = 'Predicted:N',
y = 'Actual:N',
text = 'Count:Q'
)
heatmap + text